Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
This data is part of the openintro textbook and we can load and inspect it. There are observations on 13 different variables, some categorical and some numerical.
data(yrbss)
glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.4, 55.8, 46.7, 67.1, 131.5, 71.2, …
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
Before we carry on with your analysis, it is always a good idea to check with skimr::skim() to get a feel for missing values, summary statistics of numerical variables, and a very rough histogram.
#use skimr::skim() to get feel for missing values
skimr::skim(yrbss)| Name | yrbss |
| Number of rows | 13583 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 12 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| grade | 79 | 0.99 | 1 | 5 | 0 | 5 | 0 |
| hispanic | 231 | 0.98 | 3 | 8 | 0 | 2 | 0 |
| race | 2805 | 0.79 | 5 | 41 | 0 | 5 | 0 |
| helmet_12m | 311 | 0.98 | 5 | 12 | 0 | 6 | 0 |
| text_while_driving_30d | 918 | 0.93 | 1 | 13 | 0 | 8 | 0 |
| hours_tv_per_school_day | 338 | 0.98 | 1 | 12 | 0 | 7 | 0 |
| school_night_hours_sleep | 1248 | 0.91 | 1 | 3 | 0 | 7 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 77 | 0.99 | 16.16 | 1.26 | 12.00 | 15.0 | 16.00 | 17.00 | 18.00 | ▁▂▅▅▇ |
| height | 1004 | 0.93 | 1.69 | 0.10 | 1.27 | 1.6 | 1.68 | 1.78 | 2.11 | ▁▅▇▃▁ |
| weight | 1004 | 0.93 | 67.91 | 16.90 | 29.94 | 56.2 | 64.41 | 76.20 | 180.99 | ▆▇▂▁▁ |
| physically_active_7d | 273 | 0.98 | 3.90 | 2.56 | 0.00 | 2.0 | 4.00 | 7.00 | 7.00 | ▆▂▅▃▇ |
| strength_training_7d | 1176 | 0.91 | 2.95 | 2.58 | 0.00 | 0.0 | 3.00 | 5.00 | 7.00 | ▇▂▅▂▅ |
From the above output, we can see that our dataset contains 13,583 observations and has 13 variables of which 8 are character and 5 are numeric variables. We can also see that for many of these variables we have missing values, the highest of which is the race of the high schoolers where we have 2,805 missing values.
We will first start with analyzing the weight of participants in kilograms. From the output above, we can already see that we have 1,004 missing values for the weights variable. We will first look at the distribution of weights of the high schoolers on a higher level and then look at it in more detail by grouping into gender and age.
#summary statistics of weights of all high schoolers
yrbss %>%
dplyr::select(weight) %>% # there are no instances where we there is no age data but there is weight data
summarise(min_weight = min(weight, na.rm = TRUE), #set na.rm = TRUE to disregard observations with no weight information
max_weight = max(weight, na.rm = TRUE),
median_weight = median(weight, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
n = n()
)## # A tibble: 1 × 6
## min_weight max_weight median_weight mean_weight sd_weight n
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 29.9 181. 64.4 67.9 16.9 13583
#visualisation of distribution of weights of all high schoolers
ggplot(yrbss, aes(x=weight)) +
geom_histogram() +
labs(title = "Distribution of weights of all high schoolers",
x = "weight",
y = "# of observations") +
theme_bw() From the summary statistics above as well as the histogram we can see that our data is right-skewed, meaning that the mean is higher than the median. Therefore, it seems that we many outliers on the higher end of the distribution that would be considered obese high schoolers.
Since the “normal” (or healthy) weight differs between boys and girls and also among age groups, we also want to look at the weight distribution in more detail for the different genders and for the different age groups. We do this with the following code.
#first remove observations where gender or age is NA to have a clear summarising
yrbss_upd <- yrbss %>%
drop_na(age, gender)
#summary statistics for weight by age
yrbss_upd %>%
group_by(age) %>% # there are no instances where we there is no age data but there is weight data
summarise(min_weight = min(weight, na.rm = TRUE),
max_weight = max(weight, na.rm = TRUE),
median_weight = median(weight, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
count = n()
)## # A tibble: 7 × 7
## age min_weight max_weight median_weight mean_weight sd_weight count
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 12 54.9 90.7 65.8 68.2 12.2 26
## 2 13 35.4 113. 62.6 65.3 21.4 18
## 3 14 31.8 159. 59.0 62.2 14.8 1368
## 4 15 29.9 146. 61.7 65.1 15.8 3097
## 5 16 33.1 163. 64.9 67.9 16.3 3202
## 6 17 38.6 159. 66 69.5 16.8 3471
## 7 18 34.0 181. 68.0 72.5 18.6 2319
#summary statistics for weight by gender
yrbss_upd %>%
group_by(gender) %>% # there are no instances where we there is no age data but there is weight data
summarise(min_weight = min(weight, na.rm = TRUE),
max_weight = max(weight, na.rm = TRUE),
median_weight = median(weight, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
count = n()
)## # A tibble: 2 × 7
## gender min_weight max_weight median_weight mean_weight sd_weight count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 female 29.9 148. 59.0 61.9 14.4 6594
## 2 male 31.8 181. 70.3 73.6 17.1 6907
#visualisation of distribution of weight by age group and gender
ggplot(yrbss_upd,aes(x = weight, fill = gender)) +
geom_density()+
facet_wrap(~ age, scales = "free")+
labs(title = "Distribution of weight by age and gender",
x = "Weight",
y = "# of observations") +
theme_bw() +
#theme(legend.position = "none") +
NULL While we do not have a lot of data for 12- and 13-year olds, we can see from the density plots of the other ages, that the weight distribution is very similar for the different ages. The data is always right-skewed with some outliers in the higher end. Comparing the different genders, we can see that the weight of male high schoolers is more variable and on average also higher than that of females.
Next, we consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions. We do this with the help of density plots.
#create densityplots to inspect relationship between physical activity and weight
yrbss %>%
filter(!is.na(physically_active_7d)) %>%
ggplot(aes(x = weight)) +
geom_density() +
facet_wrap(~physically_active_7d, nrow = 7) +
labs(title = "Distribution of weights per # of physically active days",
x = NULL,
y = "Weight") +
theme_bw() +
theme(axis.text.x = element_blank()) +
NULL After a first inspection of the data, we cannot really see a big difference in the weight of high schoolers that are less physically active compared to those that are. In order to further test the relationship, we create a new variable in the dataframe
yrbss, called physical_3plus , which will be yes if they are physically active for at least 3 days a week, and no otherwise. We also calculate the number and % of those who are and are not active for more than 3 days. We first use the count() function and see if we get the same results as group_by()... summarise()
#create new variable physical_3plus
yrbss2 <- yrbss %>%
filter(!is.na(physically_active_7d)) %>% #remove observations where no value for physicaly_active_7d
mutate(physical_3plus = ifelse(physically_active_7d >= 3, "yes", "no"))
# calculate number of high schoolers in each group with count
yrbss2 %>%
count(physical_3plus) %>%
mutate (perc = n/sum(n))## # A tibble: 2 × 3
## physical_3plus n perc
## <chr> <int> <dbl>
## 1 no 4404 0.331
## 2 yes 8906 0.669
#calculate number and percentage with group_by and summarise
yrbss2 %>%
group_by(physical_3plus) %>%
summarise(n = n()) %>%
mutate(perc = n/sum(n))## # A tibble: 2 × 3
## physical_3plus n perc
## <chr> <int> <dbl>
## 1 no 4404 0.331
## 2 yes 8906 0.669
#look at summary statistics for weight of the two groups
favstats(weight~ physical_3plus, data = yrbss2)## physical_3plus min Q1 median Q3 max mean sd n missing
## 1 no 29.9 54.4 62.6 74.8 181 66.7 17.6 4022 382
## 2 yes 33.1 56.7 65.8 77.1 160 68.4 16.5 8342 564
Next, we provide a 95% confidence interval for the population proportion of high schoolers that are NOT active 3 or more days per week.
yrbss2 %>%
summarise(total = n(),
perc_no = sum(physical_3plus == "no")/total,
SE = sqrt(perc_no * (1-perc_no)/total),
z_critical = qt(0.975, total - 1),
lower = perc_no - z_critical * SE,
upper = perc_no + z_critical * SE
)## # A tibble: 1 × 6
## total perc_no SE z_critical lower upper
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 13310 0.331 0.00408 1.96 0.323 0.339
From the output above, we can see that the 95% confidence interval for the proportion of high schoolers being active 2 days or less per week is [0.323; 0.339].
To inspect the relationship between physical_3plus vs. weight, we next create a boxplot diagram.
# Boxplot of weights for those who answered yes or no
yrbss2 %>%
ggplot(aes(x = physical_3plus, y = weight))+
geom_boxplot()+
labs(title = "Relationship between physical activity and weight of high schoolers",
x = "At least 3 physically active days per week",
y = "Weight (kg)") +
theme_bw() +
NULL Somewhat surprisingly, those who exercise more than 3 times a week have a marginally higher median weight than those who exercise less than or equal to 3 times a week. Perhaps this is a result of the composition of those that do exercise more than 3 times a week, influential variables likely being age, gender and height. It might also be the case that those who excercise more have more muscles which are known to weigh more. Unsurprisingly, the highest observed weight was within the group that exercised less than 3 times a week. This was unsurprising because it is unlikely someone who exercises regularly, even considering height, age and gender, would weigh over 175kg. Considering this, the “no” group has the largest range possessing the lowest and highest weights of the entire data set. The two groups do, however, share similar inter-quartile ranges.
Boxplots show how the medians of the two distributions compare, but we can also compare the means of the distributions using either a confidence interval or a hypothesis test. Note that when we calculate the mean, SD, etc. weight in these groups using the mean function, we must ignore any missing values by setting the na.rm = TRUE.
#calculate confidence intervals for the two groups
yrbss2 %>%
group_by(physical_3plus) %>%
summarise (mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE),
count = n(),
SE = sd_weight/sqrt(count),
t_critical = qt(0.975, count - 1),
lower = mean_weight - t_critical * SE,
upper = mean_weight + t_critical * SE)## # A tibble: 2 × 8
## physical_3plus mean_weight sd_weight count SE t_critical lower upper
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 no 66.7 17.6 4404 0.266 1.96 66.2 67.2
## 2 yes 68.4 16.5 8906 0.175 1.96 68.1 68.8
There is an observed difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is at least 95% statistically significant. Let us also conduct a hypothesis test.
For our hypothesis test, we first write down the null and alternative hypotheses. The null hypothesis assumes that there is no difference in the mean weight between high schoolers who are physically active at least 3 times a week and those who are not. The alternative hypothesis states that there is a difference.
\(Claim (null \space hypothesis) \space H_0: \delta = 0\)
\(Altnerative \space hypothesis \space H_a: \delta ≠ 0\)
t.test(weight ~ physical_3plus, data = yrbss2)##
## Welch Two Sample t-test
##
## data: weight by physical_3plus
## t = -5, df = 7479, p-value = 9e-08
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -2.42 -1.12
## sample estimates:
## mean in group no mean in group yes
## 66.7 68.4
Looking at the output from the t.test() above, we can see that the p-value is very small, and the confidence interval does not contain 0. Therefore, we can reject the null hypothesis based on this output.
inferNext, we will introduce a new function, hypothesize, that falls into the infer workflow. We will use this method for conducting hypothesis tests.
But first, we need to initialize the test, which we will save as obs_diff.
obs_diff <- yrbss2 %>%
specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))Notice how we can use the functions specify and calculate again like we did for calculating confidence intervals. Here, though, the statistic we are searching for is the difference in means, with the order being yes - no != 0.
After we have initialized the test, we need to simulate the test on the null distribution, which we will save as null.
set.seed(112)
null_dist <- yrbss2 %>%
# specify variables
specify(weight ~ physical_3plus) %>%
# assume independence, i.e, there is no difference
hypothesize(null = "independence") %>%
# generate 1000 reps, of type "permute"
generate(reps = 1000, type = "permute") %>%
# calculate statistic of difference, namely "diff in means"
calculate(stat = "diff in means", order = c("yes", "no"))Here, hypothesize is used to set the null hypothesis as a test for independence, i.e., that there is no difference between the two population means. In one sample cases, the null argument can be set to point to test a hypothesis relative to a point estimate.
Also, note that the type argument within generate is set to permute, which is the argument when generating a null distribution for a hypothesis test.
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()Now that the test is initialized and the null distribution formed, we can visualise to see how many of these null permutations have a difference of at least obs_stat of 1.77.
We can also calculate the p-value for your hypothesis test using the function infer::get_p_value().
null_dist %>% visualize() +
shade_p_value(obs_stat = obs_diff, direction = "two-sided")null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
This the standard workflow for performing hypothesis tests. We have got the same result (rejection of null hypothesis) as in doing hypothesis test with t.test() function.
Let us recall the IMBD ratings data from the past homeworks. This time, we want to explore whether the mean IMDB rating for Steven Spielberg and Tim Burton are the same or not. Looking at the graph below, where the the confidence intervals for the mean ratings of these two directors have been calculated, we can see they overlap.
As a first step, we want to reproduce this graph.
In addition, we will run a hypothesis test using both the t.test command and the infer package to simulate from a null distribution, where we assume zero difference between the two.
Before anything, we write down the null and alternative hypotheses.
We set up a null hypothesis that assumes that there is no difference in means, i.e. that the difference is 0 and the effect is not real. Our alternative hypothesis is that the difference is not 0. In mathematical terms, this leads to the following two hypotheses:
\(Claim (null \space hypothesis) \space H_0: \delta = 0\)
\(Altnerative \space hypothesis \space H_a: \delta ≠ 0\)
We also choose our significance level at 5%.
Now we can load the data and examine its structure.
movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)## Rows: 2,961
## Columns: 11
## $ title <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
In order to reproduce the graph above, we will first modify the movies dataset.
#prepare dataset
directors <- movies %>%
filter(director %in% c("Tim Burton", "Steven Spielberg")) #only keep observations with Burton and Spielberg as directors
#calculate CI of mean ratings with formula
directors_CI <- directors %>%
group_by(director) %>%
summarise(
mean_rating = mean(rating),
n = count(director),
SE = sd(rating)/sqrt(n),
t_critical = qt(0.975, (n-1)),
lower = mean_rating - t_critical * SE,
upper = mean_rating + t_critical * SE
)
directors_CI## # A tibble: 2 × 7
## director mean_rating n SE t_critical lower upper
## <chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Steven Spielberg 7.57 23 0.145 2.07 7.27 7.87
## 2 Tim Burton 6.93 16 0.187 2.13 6.53 7.33
Now that we have calculated the confidence intervals for the ratings of the two directors, we can plot the graph from above.
ggplot(directors_CI, aes(x = mean_rating,
y = fct_reorder(director, mean_rating),
color = director)) +
#add rectangle first so that appears in background
geom_rect(xmin = 7.27, xmax = 7.33,
ymin = 0, ymax = Inf,
linetype = "blank",
fill = "grey",
alpha = 0.5) +
#add point and bar showing CIs
geom_point(size = 5) +
geom_errorbarh(aes(xmin = lower, #add bars to visualise CI
xmax = upper),
size = 2,
height = 0.1) +
labs(title = "Do Spielberg and Burton have the same mean IMDB ratings?",
subtitle = "95% confidence intervals overlap",
x = "Mean IMBD Rating",
y = NULL,
color = NULL) +
#add labels to CIs
annotate("text", x = 7.57, y = 2.13, label = "7.57", size = 8) +
annotate("text", x = 7.27, y = 2.12, label = "7.27", size = 5) +
annotate("text", x = 7.87, y = 2.12, label = "7.87", size = 5) +
annotate("text", x = 6.93, y = 1.13, label = "6.93", size = 8) +
annotate("text", x = 6.53, y = 1.12, label = "6.53", size = 5) +
annotate("text", x = 7.33, y = 1.12, label = "7.33", size = 5) +
theme_bw() +
theme(plot.title = element_text(size = 13, face = "bold"),
text = element_text(size = 11),
legend.position = "none") +
NULL Since the confidence intervals shown above overlap, we will run a hypothesis test in order to check whether there is a significant difference in means. We will do this once with the
t.test() functions and the other time with the help of the infer package.
#hypothesis test using t.test()
t.test(rating ~ director, data = directors)##
## Welch Two Sample t-test
##
## data: rating by director
## t = 3, df = 31, p-value = 0.01
## alternative hypothesis: true difference in means between group Steven Spielberg and group Tim Burton is not equal to 0
## 95 percent confidence interval:
## 0.16 1.13
## sample estimates:
## mean in group Steven Spielberg mean in group Tim Burton
## 7.57 6.93
#hypothesis test using simulation
set.seed(113)
#calculate observed difference in means
observed_difference <- directors %>%
specify(rating ~ director) %>%
calculate(stat = "diff in means", order = c("Steven Spielberg", "Tim Burton"))
#simulate data under a scenario where difference is 0
ratings_comp <- directors %>%
specify(rating ~ director) %>%
hypothesise(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means",
order = c("Steven Spielberg", "Tim Burton"))
#visualise distribution of null hypothesis world and show observed difference as red line
ratings_comp %>% visualize() +
shade_p_value(obs_stat = observed_difference, direction = "two-sided")#calculate p-value
ratings_comp %>%
get_pvalue(obs_stat = observed_difference,
direction = "both")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.002
Resulting test statistic and the associated t-stat or p-value and conclusion
Looking at the results of t.test and the simulation exercise, we reject the null hypothesis. t.test gives us a t-statistic of 3 and a p-value of 0.01, which is significantly below our pre-set significance level. Looking at the resulting confidence interval of [0.16, 1.13] from the t.test function, we can also see that 0 does not lie in this interval. This result is confirmed by the hypothesis test done with the help of the infer package, which results in a p-value of 0.002, which is also well below our significance level of 0.05. Therefore, we conclude that the movies by Steven Spielberg and Tim Burton do not have the same IMDB rating, with Steven Spielberg achieving higher ratings.
At the last board meeting of Omega Group Plc., the headquarters of a large multinational company, the issue was raised that women were being discriminated in the company, in the sense that the salaries were not the same for male and female executives. A quick analysis of a sample of 50 employees (of which 24 men and 26 women) revealed that the average salary for men was about 8,700 higher than for women. This seemed like a considerable difference, so it was decided that a further analysis of the company salaries was warranted.
The objective of this analysis is to find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.
omega <- read_csv(here::here("data", "omega.csv"))
unique(omega$gender)## [1] "male" "female"
#relevel gender variable to calculate difference and correlation later
#omega <- omega %>%
# mutate(gender = factor(gender,
# levels = c("male","female"),
# labels = c("male","female"),
# ordered = TRUE))
glimpse(omega) # examine the data frame## Rows: 50
## Columns: 3
## $ salary <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…
The data frame omega contains the salaries for the sample of 50 executives in the company. We want to answer the question whether there is a significant difference between the salaries of the male and female executives.
We can perform different types of analyses, and check whether they all lead to the same conclusion
. Confidence intervals . Hypothesis testing . Correlation analysis . Regression
As a first step, we calculate summary statistics on salary by gender. Also, we create and print a dataframe where, for each gender, we show the mean, SD, sample size, the t-critical, the SE, the margin of error, and the low/high endpoints of a 95% confidence interval
# Summary Statistics of salary by gender
mosaic::favstats (salary ~ gender, data=omega)## gender min Q1 median Q3 max mean sd n missing
## 1 female 47033 60338 64618 70033 78800 64543 7567 26 0
## 2 male 54768 68331 74675 78568 84576 73239 7463 24 0
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size,
# the t-critical value, the standard error, the margin of error,
# and the low/high endpoints of a 95% confidence interval
salary_by_gender <- omega %>%
group_by(gender) %>%
summarise(mean_salary = mean(salary),
sd_salary = sd(salary),
n = n(),
t_crit = qt(0.975, n-1),
SE = sd_salary/sqrt(n),
margin_of_error = t_crit * SE,
lower = mean_salary - margin_of_error,
upper = mean_salary + margin_of_error
)
salary_by_gender## # A tibble: 2 × 9
## gender mean_salary sd_salary n t_crit SE margin_of_error lower upper
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 female 64543. 7567. 26 2.06 1484. 3056. 61486. 67599.
## 2 male 73239. 7463. 24 2.07 1523. 3151. 70088. 76390.
Conclusion from above analysis
The data analysis shows that there is indeed a difference in the salaries of men and women. The difference in the mean and median of men and women is almost 10,000 even though the dispersion of the data sets is very similar as shown by their standard deviations. The most interesting part is that the 95% confidence intervals do not overlap at all, meaning that we can be 95% confident that the true means do not overlap and we can reject the null hypothesis that there is no difference in the salary means of men and women.
We can also run a hypothesis testing, assuming as a null hypothesis that the mean difference in salaries is zero, or that, on average, men and women make the same amount of money. We run our hypothesis testing using t.test() and with the simulation method from the infer package.
# hypothesis testing using t.test()
t.test(salary ~ gender, data=omega)##
## Welch Two Sample t-test
##
## data: salary by gender
## t = -4, df = 48, p-value = 2e-04
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -12973 -4420
## sample estimates:
## mean in group female mean in group male
## 64543 73239
# hypothesis testing using infer package
test_diff <- omega %>%
specify(salary ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps =1000, type ="permute") %>%
calculate(stat = "diff in means",
order = c("male", "female"))
#calculate difference in means to calculate mean value
salary_diff <- omega %>%
specify(salary ~ gender) %>%
calculate(stat = "diff in means", order = c("male", "female"))
test_diff %>%
get_pvalue(obs_stat = salary_diff,
direction = "both")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
As a next step, we also calculate the correlation coefficient between gender and salary and perform a linear regression.
#calculate correlation coefficient between gender and salary
# use function biserial.cor from ltm package
library(ltm)
biserial.cor(omega$salary, omega$gender, level = 2)## [1] 0.508
#alternative method - assign number to gender and calculate coefficient with cor() function
omega %>%
mutate(genderN = ifelse(gender == "male", 1, 0)) %>%
dplyr::select(genderN, salary) %>%
cor()## genderN salary
## genderN 1.000 0.508
## salary 0.508 1.000
#run linear regression with salary as dependent variable and gender as independent variable
salary_reg <- glm(salary ~ gender, data = omega)
summary(salary_reg) #significant result##
## Call:
## glm(formula = salary ~ gender, data = omega)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -18471 -4780 127 5484 14257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64543 1474 43.78 < 2e-16 ***
## gendermale 8696 2128 4.09 0.00017 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 56509713)
##
## Null deviance: 3656271994 on 49 degrees of freedom
## Residual deviance: 2712466235 on 48 degrees of freedom
## AIC: 1038
##
## Number of Fisher Scoring iterations: 2
Conclusion from the analysis above
The correlation between gender and salary was found by assigning the male gender a value of 1. The correlation was then found to be 0.508, which means there is a moderate positive correlation between salary and gender. Both hypothesis tests also shows that there is a significant difference in the salaries of men and women as the p value is 0, meaning that the null hypothesis, which states that the mean differences of the salaries of men and women is 0, can be rejected.
At the board meeting, someone raised the issue that there was indeed a substantial difference between male and female salaries, but that this was attributable to other reasons such as differences in experience. A questionnaire send out to the 50 executives in the sample reveals that the average experience of the men is approximately 21 years, whereas the women only have about 7 years experience on average (see table below).
# Summary Statistics of experience by gender
favstats (experience ~ gender, data=omega)## gender min Q1 median Q3 max mean sd n missing
## 1 female 0 0.25 3.0 14.0 29 7.38 8.51 26 0
## 2 male 1 15.75 19.5 31.2 44 21.12 10.92 24 0
Based on this evidence, as the data shows above, we can conclude that there is a huge difference in the experience of male and female executives. The mean and median for male executives is 21.1 and 19.5 respectively while the mean and median for females is 3.0 and 7.4 respectively. The difference in the mean of the two groups is 13.7 and the difference in the median of the two groups is 16.5. While there is big difference in the mean and median, the standard deviations are fairly similar meaning the spread of the datasets is somewhat similar.
We now perform similar analyses as in the previous section and will test whether theses analyses validate or endanger our conclusion about the difference in male and female salaries.
# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size,
# the t-critical value, the standard error, the margin of error,
# and the low/high endpoints of a 95% confidence interval
experience_by_gender <- omega %>%
group_by(gender) %>%
summarise(mean_experience = mean(experience),
sd_experience = sd(experience),
n = n(),
t_crit = qt(0.975, n-1),
SE = sd_experience/sqrt(n),
margin_of_error = t_crit * SE,
lower = mean_experience - margin_of_error,
upper = mean_experience + margin_of_error
)
experience_by_gender## # A tibble: 2 × 9
## gender mean_experience sd_experience n t_crit SE margin_of_error lower
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 female 7.38 8.51 26 2.06 1.67 3.44 3.95
## 2 male 21.1 10.9 24 2.07 2.23 4.61 16.5
## # … with 1 more variable: upper <dbl>
Let us know also run the t.test() and a simulation with infer.
# hypothesis testing using t.test()
t.test(experience ~ gender, data=omega)##
## Welch Two Sample t-test
##
## data: experience by gender
## t = -5, df = 43, p-value = 1e-05
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -19.35 -8.13
## sample estimates:
## mean in group female mean in group male
## 7.38 21.12
# hypothesis testing using infer package
diff_experience <- omega %>%
specify(experience ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps =1000, type ="permute") %>%
calculate(stat = "diff in means",
order = c("female", "male"))
#calculate difference in means to calculate mean value
experience_diff <- omega %>%
specify(experience ~ gender) %>%
calculate(stat = "diff in means", order = c("male", "female"))
diff_experience %>%
get_pvalue(obs_stat = experience_diff,
direction = "both")## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
#calculate correlation coefficient between gender and experience
# use function biserial.cor from ltm package
biserial.cor(omega$experience, omega$gender, level = 2)## [1] 0.584
#alternative method - assign number to gender and calculate coefficient with cor() function
omega %>%
mutate(genderN = ifelse(gender == "male", 1, 0)) %>%
dplyr::select(genderN, experience) %>%
cor()## genderN experience
## genderN 1.000 0.584
## experience 0.584 1.000
#run linear regression with salary as dependent variable and gender as independent variable
experience_reg <- glm(experience ~ gender, data = omega)
summary(experience_reg) #significant result##
## Call:
## glm(formula = experience ~ gender, data = omega)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -20.12 -6.32 -2.25 8.37 22.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.38 1.91 3.87 0.00033 ***
## gendermale 13.74 2.76 4.98 8.5e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 94.8)
##
## Null deviance: 6909.0 on 49 degrees of freedom
## Residual deviance: 4552.8 on 48 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 2
This results of our analysis above endangers our conclusion that the difference in salaries is only due to the gender of the employees. As can be seen from the two hypothesis tests as well as the correlation coefficient and the linear regression (showing that the gender male has significantly more experience), there is a difference in male and female experiences. The t test of experience and gender returned a p value of 0, meaning that the null hypothesis can be rejected. This indicates that experience might also be a reason that leads to the difference in salaries, and thus the difference in salaries of men and women is not only due to gender.
Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.
We now analyse the relationship between salary and experience. First, we draw a scatterplot to visually inspect the data.
#draw scatterplot to show relationship between salary and experience
ggplot(omega, aes(x=experience, y = salary))+
geom_point() +
geom_smooth(se=FALSE) +
theme_bw() +
labs(title = "Relationship between experience and salary",
y = "Salary",
x = "Experience (years)") +
NULL Now that we see a positive relation between salary and experience in the scatterplot, we can also calculate the correlation coefficient.
omega %>%
dplyr::select(salary, experience) %>%
cor()## salary experience
## salary 1.000 0.803
## experience 0.803 1.000
The scatterplot between salary and experience shows that there is a positive correlation between the two variables. This is supported by the correlation coefficient, which is calculated to be 0.8, which means the two variables are strongly positively correlated to one another. While correlation does not equal causation, this still is a strong sign that salaries are based on the experience of the employees.
We can use GGally:ggpairs() to create a scatterplot and correlation matrix. Essentially, we change the order our variables will appear in and have the dependent variable (Y), salary, as last in our list. We then pipe the dataframe to ggpairs() with aes arguments to colour by gender and make the plots somewhat transparent (alpha = 0.3).
omega %>%
dplyr::select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
ggpairs(aes(colour=gender, alpha = 0.3))+
theme_bw()Conclusion from salary vs experience scatterplot
The salary vs experience scatterplot infers that there is a positive relationship between the two, meaning that the higher the salary is, the higher you can expect the experience to be and vice versa. The scatterplot also shows us the dots colored according to the gender and we can clearly see that the blue dots representing male employees are more to the upper-right whereas the red dots representing female employees are mainly concentrated in the bottom left corner. This shows that male employees have a higher experience than female and are also generally paid more. However, looking at the red dot that is the furthest to the right (around 30 years of experience) and comparing it to a similar blue dot, the salary of this employee seems to be rather low compared to that of male employees with a similar level of experience. The same holds true for those employees with experiences between 10 and 20 years where the highest salaries are all paid to men. Therefore, there might still be a certain level of discrimination based on gender. However, more data would be needed in order to verify this statement.
Every so often, we hear warnings from commentators on the “inverted yield curve” and its predictive power with respect to recessions. An explainer what a inverted yield curve is can be found here. or in a great podcast from NPR on yield curve indicators. A very nice article that explains the yield curve is and its inversion can be found here
In addition, many articles and commentators think that, e.g., Yield curve inversion is viewed as a harbinger of recession. One can always doubt whether inversions are truly a harbinger of recessions, and use the attached parable on yield curve inversions.
In order to better understand yield curve inversion, we will look at US data and use the FRED database to download historical yield curve rates, and plot the yield curves since 1999 to see when the yield curves flatten and invert. At the end of this challenge we will produce this chart:
First, we will load the yield curve data file that contains data on the yield curve since 1960-01-01
yield_curve <- read_csv(here::here("data", "yield_curve.csv"))
glimpse(yield_curve)## Rows: 6,884
## Columns: 5
## $ date <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01,…
## $ series_id <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS…
## $ value <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, …
## $ maturity <chr> "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", "3m", …
## $ duration <chr> "3-Month Treasury Bill", "3-Month Treasury Bill", "3-Month T…
Our dataframe yield_curve has five columns (variables):
date: already a date objectseries_id: the FRED database ticker symbolvalue: the actual yield on that datematurity: a short hand for the maturity of the bondduration: the duration, written out in all its glory!We will produce three plots to see what yield curves look like since 1960. We will firstly look at the sample plot and then try to produce our own version
Sample plot:
Produce our own plot:
#use factor() to convert duration into a factor (with order)
yield_curve <- yield_curve %>%
mutate(duration = factor(duration,
levels = c("3-Month Treasury Bill",
"6-Month Treasury Bill",
"1-Year Treasury Rate",
"2-Year Treasury Rate",
"3-Year Treasury Rate",
"5-Year Treasury Rate",
"7-Year Treasury Rate",
"10-Year Treasury Rate",
"20-Year Treasury Rate",
"30-Year Treasury Rate")))
#create the line plot
ggplot(yield_curve,
aes(x = date,
y = value,
color = duration)) +
geom_line() +
facet_wrap(~duration,
nrow = 5) +
labs(x = NULL,
y = "%",
title = "Yields on U.S. Treasury rates since 1960",
caption = "Source: St. Louis Federal Reserve Economic Database (FRED)") +
theme_bw() +
theme(text = element_text(size = 11),
legend.position = "none") +
NULLSample plot:
Produce our own plot:
Before we do the plotting, we need to add two new columns to explicitly record the year and month of every date.
#use factor() to convert maturity into a factor (with order)
#label the year and month of every sample point
yield_curve <- yield_curve %>%
mutate(maturity = factor(maturity,
levels = c("3m", "6m", "1y", "2y", "3y", "5y", "7y", "10y", "20y", "30y")),
year = year(date),
month = month(date, label = TRUE))
glimpse(yield_curve)## Rows: 6,884
## Columns: 7
## $ date <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-05-01,…
## $ series_id <chr> "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS", "TB3MS…
## $ value <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, 2.30, …
## $ maturity <fct> 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, 3m, …
## $ duration <fct> 3-Month Treasury Bill, 3-Month Treasury Bill, 3-Month Treasu…
## $ year <dbl> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, …
## $ month <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, …
Now we can proceed with plotting.
#create the line plot
#use factor() to prevent the color pool from becoming continuous
yield_curve %>%
filter(year >= 1999) %>%
ggplot(aes(x = maturity,
y = value,
color = factor(year))) +
#use group argument to group data by month in each facet
geom_line(aes(group = month)) +
facet_wrap(~year,
nrow = 6) +
labs(x = "Maturity",
y = "Yield(%)",
title = "US Yield Curve",
caption = "Source: St. Lous Federal Reserve Economic Database (FRED)") +
theme_bw() +
theme(text = element_text(size = 11),
legend.position = "none") +
NULLSample plot:
Produce our own plot:
yield_curve %>%
#filter by multiple conditions
filter(maturity == "3m" | maturity == "10y",
year >= 1999) %>%
#create line plot
ggplot(aes(x = date,
y = value,
color = duration)) +
geom_line() +
labs(x = NULL,
y = "%",
title = "Yields on 3-month and 10-year US Treasury rates since 1999",
caption = "Source: St. Lous Federal Reserve Economic Database (FRED)") +
theme_bw() +
theme(text = element_text(size = 11),
legend.title = element_blank()) +
NULLFirst let’s use the graph above to see how yield curve inversion predicted recession since 1999.
According to Wikipedia’s list of recession in the United States, since 1999 there have been two recession in the US: between Mar 2001–Nov 2001 and between Dec 2007–June 2009. In this case, we see from the second graph that yield curve flattened in 2000 and 2006-2007, which perfectly preceded the two recessions mentioned. Additionally, short-term (3 months) yield exceeded longer term (10 years) yield briefly in late 2000 and late 2006, both signalling an upcoming recession.
To generalize this phenomena, we need to expand our time horizon and make a more informative graph indicating when the recession arrived.
The code below creates a dataframe with all US recessions since 1946.
# get US recession dates after 1946 from Wikipedia
# https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States
recessions <- tibble(
from = c("1948-11-01", "1953-07-01", "1957-08-01", "1960-04-01", "1969-12-01", "1973-11-01", "1980-01-01","1981-07-01", "1990-07-01", "2001-03-01", "2007-12-01","2020-02-01"),
to = c("1949-10-01", "1954-05-01", "1958-04-01", "1961-02-01", "1970-11-01", "1975-03-01", "1980-07-01", "1982-11-01", "1991-03-01", "2001-11-01", "2009-06-01", "2020-04-30")
) %>%
mutate(from = ymd(from),
to = ymd(to),
duration_days = to - from)
head(recessions)## # A tibble: 6 × 3
## from to duration_days
## <date> <date> <drtn>
## 1 1948-11-01 1949-10-01 334 days
## 2 1953-07-01 1954-05-01 304 days
## 3 1957-08-01 1958-04-01 243 days
## 4 1960-04-01 1961-02-01 306 days
## 5 1969-12-01 1970-11-01 335 days
## 6 1973-11-01 1975-03-01 485 days
We will first calculate the spread between 10-year and 3-month treasury yields and then seperately mark the positive and negative spread into two new columns.
yield_curve_spread <- yield_curve %>%
#select only relevant data
dplyr::select(date, value, maturity) %>%
#choose only 3-month and 10-year yields
filter(maturity == "3m" | maturity == "10y") %>%
#pivot the table to wider form with maturity as new names
pivot_wider(names_from = maturity,
values_from = value) %>%
#calculate yield spread, need to use ` to recognize column names with number
mutate(spread = `10y` - `3m`,
positive_spread = pmax(spread, 0),
negative_spread = pmin(spread, 0))
glimpse(yield_curve_spread)## Rows: 740
## Columns: 6
## $ date <date> 1960-01-01, 1960-02-01, 1960-03-01, 1960-04-01, 1960-…
## $ `3m` <dbl> 4.35, 3.96, 3.31, 3.23, 3.29, 2.46, 2.30, 2.30, 2.48, …
## $ `10y` <dbl> 4.72, 4.49, 4.25, 4.28, 4.35, 4.15, 3.90, 3.80, 3.80, …
## $ spread <dbl> 0.37, 0.53, 0.94, 1.05, 1.06, 1.69, 1.60, 1.50, 1.32, …
## $ positive_spread <dbl> 0.37, 0.53, 0.94, 1.05, 1.06, 1.69, 1.60, 1.50, 1.32, …
## $ negative_spread <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
Now we can reproduce the graph:
ggplot(yield_curve_spread,
aes(x = date)) +
#add line plot for yield spreads and a horizontal line for zero spread
geom_line(aes(y = spread)) +
geom_hline(yintercept = 0,
color = "black",
size = 0.6) +
#use geom_ribbon to see whether spread is positive or negative
geom_ribbon(aes(ymin = 0, ymax = positive_spread),
fill = "dodgerblue3",
alpha = 0.3) +
geom_ribbon(aes(ymin = negative_spread, ymax = 0),
fill = "red3",
alpha = 0.3) +
#use geom_rect to add the grey shaded areas corresponding to recessions
#geom_rect doesn't produce the desired effect in terms of alpha, so we use annotate(geom = "rect", ...) to optimize the result
ggplot2::annotate(geom = "rect",
xmin = recessions$from, xmax = recessions$to,
ymin = -Inf, ymax = Inf,
fill = "black",
alpha = 0.2) +
#use geom_rug for the small indicators at the bottom, with two differently sliced dataset from yield_curve_spread
geom_rug(data = yield_curve_spread[yield_curve_spread[ , "positive_spread"] != 0, , drop = FALSE],
color = "dodgerblue3",
alpha = 0.5,
sides = "b") +
geom_rug(data = yield_curve_spread[yield_curve_spread[ , "negative_spread"] != 0, , drop = FALSE],
color = "red3",
alpha = 0.5,
sides = "b") +
#set the format of x-axis
scale_x_date(date_labels = "%Y",
limits = c(date("1959-01-01"), date("2023-01-01")),
expand = c(0.03, 0.03),
date_breaks = "2 years",
date_minor_breaks = "1 year") +
#some other formatting
labs(x = NULL,
y = "Difference (10 year - 3 month) yield in %",
title = "Yield Curve Inversion: 10-year minus 3-month U.S. Treasury rates",
subtitle = "Difference in % points, monthly averages\nShaded areas correspond to recessions",
caption = "Source: FRED, Federal Reserve Bank at St. Louis") +
theme_bw() +
theme(text = element_text(size = 9),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic"),
panel.border = element_blank(),
panel.grid = element_line(color = "grey95"),
axis.ticks = element_blank()) +
NULLOur answer for whether yield curve inversion can predict recession is: roughly yes. Among all 8 yield curve inversions since 1960, 7 of them were closely followed by an economic recession in US, and among all 8 recessions since 1960, also 7 of them were preceded by a yield curve inversion, while the remaining one (Jul 1990 - Mar 1991) was preceded by a deep yield curve flattening where the spread between 10-year and 3-month US Treasury rates almost hit zero.
One explanation for such relationship is expectations hypothesis: long-term rates are the expected short-term rate in the future. Under this hypothesis, yield on 10-year Treasury bond can be interpreted as the market’s expectation on future short-term rate. When the market expect a rate cut from the Federal Reserve in the future in order to boost weak economy, this expectation will be reflected on the declining long-term yield. A self-fulfilling prophecy will pull down long-term yield further, eventually to a level below short-term yield.
Another explanation is, before recession actually arrives at the economy, investors who foresee such downside risk would rush to buy long-term bonds to secure their future cash flow as a safe haven asset, thus pushing up the price of long-term bonds, reducing their yields. When more and more people realize how bad the economic outlook is, long-term yield will go further down until a yield curve inversion happens.
In conclusion, if a yield curve inversion happens today, it is highly likely that a recession is on the way. We don’t have a certain causality at this point, but the repeated sequences of historic yield curve inversions and recessions have convinced us that these two things are correlated.
At the risk of oversimplifying things, the main components of gross domestic product, GDP are personal consumption (C), business investment (I), government spending (G) and net exports (exports - imports). One can read more about GDP and the different approaches in calculating at the Wikipedia GDP page.
The GDP data we will look at is from the United Nations’ National Accounts Main Aggregates Database, which contains estimates of total GDP and its components for all countries from 1970 to today. We will look at how GDP and its components have changed over time, and compare different countries and how much each component contributes to that country’s GDP. The file we will work with is GDP and its breakdown at constant 2010 prices in US Dollars and it has already been saved in the Data directory.
UN_GDP_data <- read_excel(here::here("data", "Download-GDPconstant-USD-countries.xls"), # Excel filename
sheet="Download-GDPconstant-USD-countr", # Sheet name
skip=2) # Number of rows to skipThe first thing we need to do is to tidy the data, as it is in wide format and we must make it into long, tidy format. We also express all figures in billions and we also want to rename the indicators we are interested in into something shorter.
tidy_GDP_data <- UN_GDP_data %>%
#pivot data to long format
pivot_longer(cols = 4:51,
names_to = "Year",
values_to = "Value") %>%
#Express all figures in billions
mutate(Value = Value/1e9)
#look at dataset
glimpse(tidy_GDP_data)## Rows: 176,880
## Columns: 5
## $ CountryID <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ Country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanista…
## $ IndicatorName <chr> "Final consumption expenditure", "Final consumption expe…
## $ Year <chr> "1970", "1971", "1972", "1973", "1974", "1975", "1976", …
## $ Value <dbl> 5.56, 5.33, 5.20, 5.75, 6.15, 6.32, 6.37, 6.90, 7.09, 6.…
#check names of different components
unique(tidy_GDP_data$IndicatorName)## [1] "Final consumption expenditure"
## [2] "Household consumption expenditure (including Non-profit institutions serving households)"
## [3] "General government final consumption expenditure"
## [4] "Gross capital formation"
## [5] "Gross fixed capital formation (including Acquisitions less disposals of valuables)"
## [6] "Exports of goods and services"
## [7] "Imports of goods and services"
## [8] "Gross Domestic Product (GDP)"
## [9] "Agriculture, hunting, forestry, fishing (ISIC A-B)"
## [10] "Mining, Manufacturing, Utilities (ISIC C-E)"
## [11] "Manufacturing (ISIC D)"
## [12] "Construction (ISIC F)"
## [13] "Wholesale, retail trade, restaurants and hotels (ISIC G-H)"
## [14] "Transport, storage and communication (ISIC I)"
## [15] "Other Activities (ISIC J-P)"
## [16] "Total Value Added"
## [17] "Changes in inventories"
We further clean the dataset so that we can produce this plot.
#prepare data for plotting
#create vector that includes the components we want to plot
components <- c("Gross capital formation", "Exports of goods and services", "General government final consumption expenditure",
"Household consumption expenditure (including Non-profit institutions serving households)",
"Imports of goods and services", "Gross Domestic Product (GDP)")
#create vector to order IndicatorName later
ordered_components <- c("Gross capital formation", "Exports", "Government expenditure", "Household expenditure", "Imports")
# Let us compare GDP components for these 3 countries
country_list <- c("United States","India", "Germany")
tidy_GDP_data2 <- tidy_GDP_data %>%
#select only those observations that show data for countries in country list and the 5 components we are interest in
filter(Country %in% country_list & IndicatorName %in% components) %>%
#change names of components
mutate(IndicatorName =
ifelse(str_detect(IndicatorName, "Exports|Imports"),
str_sub(IndicatorName,1,7),
ifelse(str_detect(IndicatorName, "government"),
"Government expenditure",
ifelse(str_detect(IndicatorName, "Household"),
"Household expenditure",
ifelse(str_detect(IndicatorName, "Domestic"),
"GrossDomesticProduct",
IndicatorName)
))),
#IndicatorName = factor(IndicatorName, levels = ordered_components),
Year = as.double(Year)
)
#check new dataset
glimpse(tidy_GDP_data2)## Rows: 864
## Columns: 5
## $ CountryID <dbl> 276, 276, 276, 276, 276, 276, 276, 276, 276, 276, 276, 2…
## $ Country <chr> "Germany", "Germany", "Germany", "Germany", "Germany", "…
## $ IndicatorName <chr> "Household expenditure", "Household expenditure", "House…
## $ Year <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 19…
## $ Value <dbl> 872, 919, 969, 997, 995, 1032, 1076, 1122, 1162, 1200, 1…
#plot the data
tidy_GDP_data2 %>%
filter(IndicatorName != "GrossDomesticProduct") %>%
mutate(IndicatorName = factor(IndicatorName, levels = ordered_components)) %>%
ggplot(aes(x = Year, y = Value, color = IndicatorName)) + #color by component
geom_line(size = 0.8) +
facet_wrap(~Country) + #plot by country
theme_bw() + #set theme to black and white
labs(title = "GDP components over time",
subtitle = "In constant 2010 USD",
x = NULL,
y = "Billion US$",
color = "Components of GDP") +
theme(plot.title = element_text(size = 14, face = "bold")) + #change formatting of title
scale_x_continuous(breaks = seq(1970, 2017, by = 10), #set grid
minor_breaks = seq(1970, 2017, by = 10)) +
NULLSecondly, recall that GDP is the sum of Household Expenditure (Consumption C), Gross Capital Formation (business investment I), Government Expenditure (G) and Net Exports (exports - imports). Even though there is an indicator Gross Domestic Product (GDP) in our dataframe, we would like to calculate it given its components discussed above.
#calculate GDP from components
pivot_GDP_data <- tidy_GDP_data2 %>%
#pivot dataframe to wider format
pivot_wider(names_from = IndicatorName, values_from = Value) %>%
#calculate GDP from components and difference to given value
mutate(`Net Exports` = Exports - Imports,
GDP = `Household expenditure` + `Government expenditure` + `Gross capital formation` + `Net Exports`,
delta_GDP = (GDP - GrossDomesticProduct)/GrossDomesticProduct,
#abs_delta_GDP = abs((GDP - GrossDomesticProduct)/GrossDomesticProduct)
) #calculate %age difference as
favstats(~delta_GDP, data = pivot_GDP_data)## min Q1 median Q3 max mean sd n missing
## -0.0634 3.48e-05 0.00732 0.021 0.0741 0.00867 0.0226 144 0
% difference between what we calculated as GDP and the GDP figure included in the dataframe
ggplot(pivot_GDP_data, aes(x=delta_GDP)) +
geom_histogram() +
scale_x_continuous(label = scales::percent_format(accuracy = 0.01)) +
labs(title = "Distribution of differences in GDP values",
x = "Difference between calculated and given GDP values as %age of given GDP value",
y = NULL) +
theme_bw() +
NULLThe output of the histogram above shows the summary statistics of the percentage differences we calculated between the GDP figure we calculated from the different components and the one that was already given in the dataset. Looking at the mean and median, we can see that the difference is less than 1%, even though there are some observations where the difference is considerably larger, as can be seen from the minimum value of -6.34% and the maximum value of 7.41%. We can also see that the data is slightly right skewed as the mean is higher than the median. Even though the mode is very close to 0, we can also see that the variability of the data is relatively high as the standard deviation lies at 0.0226. We can therefore conclude that not all of the given GDP values in the dataframe are consistent with the definition of GDP.
As a next step, we want to reproduce the following graph that expresses the GDP components in percentages.
#modify dataframe in order to express components as %age of GDP
tidy_GDP_data3 <- pivot_GDP_data %>%
#change columns to express components as %age of GDP
mutate(`Government expenditure` = `Government expenditure`/GDP,
`Household expenditure` = `Household expenditure`/GDP,
`Gross capital formation` = `Gross capital formation`/GDP,
`Net Exports` = `Net Exports`/GDP) %>%
#select only necessary columns
dplyr::select(CountryID, Country, Year, `Household expenditure`, `Government expenditure`,
`Gross capital formation`, `Net Exports`) %>%
#pivot back to longer format
pivot_longer(cols = 4:7,
names_to = "IndicatorName",
values_to = "Value")
#look at resulting dataframe
glimpse(tidy_GDP_data3)## Rows: 576
## Columns: 5
## $ CountryID <dbl> 276, 276, 276, 276, 276, 276, 276, 276, 276, 276, 276, 2…
## $ Country <chr> "Germany", "Germany", "Germany", "Germany", "Germany", "…
## $ Year <dbl> 1970, 1970, 1970, 1970, 1971, 1971, 1971, 1971, 1972, 19…
## $ IndicatorName <chr> "Household expenditure", "Government expenditure", "Gros…
## $ Value <dbl> 0.55152, 0.17732, 0.27973, -0.00857, 0.56131, 0.18200, 0…
#plot data
ggplot(tidy_GDP_data3, aes(x=Year, y = Value, color = IndicatorName)) + #color by component
geom_line(size = 0.8) +
facet_wrap(~Country) + #plot by country
theme_bw() + #set theme to black and white
labs(title = "GDP and its breakdown at constant 2010 prices in US Dollars",
x = NULL,
y = "proportion",
color = NULL,
caption = "Source: United Nations, https://unstats.un.org/unsd/snaama/Downloads") +
theme(plot.title = element_text(size = 14, face = "bold")) + #change formatting of title
scale_x_continuous(breaks = seq(1970, 2017, by = 10), #set grid
minor_breaks = seq(1970, 2017, by = 10)) +
scale_y_continuous(label = scales::percent_format(accuracy = 0.1)) +
NULLExplanation of what the chart is telling us
The chart above shows the proportion of the GDP components of total GDP for the countries Germany, India and the United States over time. While we can see that the ranking of these components is fairly similar for the three countries, i.e. household expenditure contributes most to GDP and net exports the least, the dynamics of the components for the individual countries are different.
Let us look at Germany for which we can see that the proportion of government expenditure and household expenditure have remained fairly stable in the years between 1970 and 2017, around 20% and between 55% and 60% of GDP respectively, but the latter has somewhat decreased in the past decade. Gross capital formation expenditures (business investments), by contrast, have decreased. While they accounted for almost 30% of GDP in 1970, they now only account for approximately 20%. This can be explained by the fact that companies are saving more and investing less in this country. While German companies are generally known to be forward-thinking and high-tech businesses, this lack of corporate expenditures at home might be because companies do not see Germany as a promising market in the future and therefore rather invest in countries where they see more growth opportunities (e.g., in Asia). In addition, the environment for starting and operating a company in Germany has been criticized by many which might also be an explanation why the corporate investments are decreasing. While corporate investments have decreased, the net exports have increased considerably after 2000. German’s companies are well-known for quality, especially in the electrical, engineering, chemical and vehicle construction industry. Since people are becoming richer in many nations such as China, these high-quality goods also become more attractive and affordable for these people, which might explain this rise in net exports.
For India, the pattern is very different. While household expenditure made up more than 70% of GDP at the beginning of the measurement period, this proportion has decreased to less than 60% in the past decade. This significant decrease is mirrored by a similarly sharp increase in gross capital formation. This means that investments of companies have increased which is most likely explained by the fact that India has experienced significant economic development in the past years and is becoming a more promising market for companies and investors. While the share of government expenditures has also remained relatively flat, there has been some activity in the proportion of net exports which has, despite a recent increase, remained below zero, meaning that imports are still higher than exports and India sees a trade deficit.
Similar to India, the United States also import more than they export as their net exports are negative. However, this trend has intensified only in the past 20 years where the net exports amounted to approximately -5% of GDP while it was relatively even at 0 in the period between 1970 and 2000. While this trade deficit probably has different drivers than that of India, namely that consumers and households want to have more choice and can afford importing different goods from abroad, this trend has raised concerns among many Americans, not only Donald Trump. It is also worth highlighting that the United States are the only country of the three that has seen a constant decrease in the proportion of government expenditure of GDP, which might be explained by the attempt to reduce government debt. This decrease is matched by an increase in the proportion of household expenditure. While US government expenditure as a proportion of GDP was still higher than in India, the value of less than 15% is significantly lower than that of Germany at around 20%. Comparing the corporate investments of the US to Germany, shows that, while the proportion is similar at around 20%, the United States have seen a positive trend over the past years whereas Germany has seen a negative trend. This might indicate that the conditions for running a business in the US are better or improving, giving companies an incentive to grow and develop their businesses in the United States.
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else? Yes
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.